AITopics | Speech

Collaborating Authors

Speech

Speech encompasses speech understanding/recognition and speech synthesis.

News Overviews Instructional Materials AI-Alerts Classics

Google now uses your uploaded search media to train AI

EngadgetJul-6-2026, 19:13:52 GMT

A few simple adjustments to your settings will opt you out. Google is at it again . The company recently, and quietly, introduced a change to how it hoovers up our data to train its AI platforms. It can now scoop up media you upload to its various search tools for training purposes, according to a report by . This includes images, files and audio and video recordings.

artificial intelligence, natural language, social media, (11 more...)

Engadget

Industry: Leisure & Entertainment > Games > Computer Games (0.77)

Technology:

Information Technology > Communications > Mobile (0.56)
Information Technology > Communications > Social Media (0.44)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.44)
Information Technology > Artificial Intelligence > Natural Language (0.34)

Add feedback

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Neural Information Processing SystemsJun-23-2026, 09:57:49 GMT

We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

Neural Information Processing SystemsJun-23-2026, 08:31:09 GMT

We propose Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. Unlike conventional FM modules, which use the coarse representations from the weak generator as conditions, SFM constructs intermediate states along the FM paths from these representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a singlesegment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise, thereby focusing computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments demonstrate that SFM yields consistent gains in speech naturalness across both objective and subjective evaluations, and significantly accelerates inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.

artificial intelligence, machine learning, speech synthesis, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Boosting

Neural Information Processing SystemsJun-23-2026, 03:52:49 GMT

Attention-based encoder decoder models remain a popular choice for state-of-the-art automatic speech recognition (ASR). These models combine a powerful audio encoder that extracts rich acoustic features with a decoder that autoregressively produces the ASR output. The decoder handles two critical tasks: (1) building rich text-only context and (2) merging acoustic information from the encoder to ensure the predictions remain faithful to the audio. We observe a systematic pattern across the attention distributions of decoder layers in prior architectures: the initial layers direct most attention towards building textual context, while the later layers largely focus on merging acoustic and textual information for the final predictions. Leveraging this key insight, we propose BLOCKDECODER, a novel decoder architecture comprising two distinct components: a text encoder that is purely text-based, and a MERGER that combines information from the audio encoder and text encoder to generate output tokens. Unlike traditional decoders, the MERGER autoregressively predicts a sequence of K tokens within a block of size K, while relying on the same precomputed contextual information from both text and audio encoders across the block. This design choice allows for the efficient reuse of encoder representations. The separation of the decoder into the text encoder and the MERGER promotes modularity and more flexible control of parameters via the number of text encoder and MERGER layers. As a result, BLOCKDECODER yields a significant speedup ( 2x) compared to traditional decoders, across diverse datasets, languages, and speech tasks, without any degradation in performance.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe (0.46)
Asia (0.46)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

PhysioWave: AMulti-Scale Wavelet-Transformer for Physiological Signal Representation

Neural Information Processing SystemsJun-23-2026, 03:23:25 GMT

Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel waveletbased approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to nextgeneration physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications.

data mining, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: Europe (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.68)
Health & Medicine > Diagnostic Medicine > Imaging (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(7 more...)

Add feedback

FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

Neural Information Processing SystemsJun-23-2026, 02:17:38 GMT

The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency 2.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Asia (1.00)
Europe (0.67)

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multi-head Temporal Latent Attention

Neural Information Processing SystemsJun-23-2026, 01:57:06 GMT

While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3 speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota (0.28)
Europe > Austria (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

RESPIN-S1.0: A read speech corpus of 10000+ hours in dialects of nine Indian Languages

Neural Information Processing SystemsJun-22-2026, 22:18:19 GMT

Indian languages exhibit high dialectal variation and are spoken by populations that remain digitally underserved. Existing speech corpora typically represent only standard dialects and lack domain and linguistic diversity.

machine learning, natural language, utterance, (22 more...)

Neural Information Processing Systems

Country: Asia > India (1.00)

Genre:

Workflow (0.93)
Overview (0.67)

Industry:

Media (1.00)
Food & Agriculture > Agriculture (1.00)
Banking & Finance (1.00)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(3 more...)

Add feedback

MRS Audio: ALarge-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Neural Information Processing SystemsJun-22-2026, 22:03:16 GMT

Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: Asia (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Neural Information Processing SystemsJun-22-2026, 10:27:57 GMT

Where does'A man is walking in a Locate the moment where "A man For the query'A man recommends narrow alley, with street noise and Determine the precise timestamp in wearing a white mask is speaking visiting local areas in Tokyo, filming the conversations in the background.

caption, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country: